Connectivity in Bag Generation 



Arturo Trujillo and Simon Berry* 

School of Computer and Mathematical Sciences 
The Robert Gordon University, St Andrew Street 
Aberdeen AB25 IHG 
Scotland 
{iat,cs5sby}@scms. rgu.ac.uk 



u . 

< 

O 

m 



> 
O 

o 

I 



X 



Abstract 



mally recursive semantic representations (Copes- 



take et al., 1995) and other semantic frameworks 



This paper presents a pruning tech- 
nique which can be used to reduce the 
number of paths searched in rule-based 
bag generators of the t ype pr oposed by 
( Poznahski et al., 1995 ) and ( Popowich] 
1995). Pruning the search space in these 



which separate scoping from content information 



generators is important given the com- 
putational cost of bag generation. The 
technique relies on a connectivity con- 
straint between the semantic indices as- 
sociated with each lexical sign in a bag. 
Testing the algorithm on a range of sen- 
tences shows reductions in the genera- 
tion time and the number of edges con- 
Btructod. 



(Reyle, 1995). In these frameworks, the unordered 
nature of predicate or relation sets makes the ap- 
plication of bag generation techniques attractive. 

A notational convention used in the paper is 
that items such as 'dogi' stand for simplified lex- 
ical signs of the form ( Bhieber, 1986| ) : 



CAT = N 

RELN = 
ARGl: 



dog] 
= 1 



In such signs, the semantic argument will be re- 
ferred to as an 'index' and will be shown as a 
subscript to a lexeme; in the above example, the 
index has been given the unique type 1. 



The term index is borrowed from HPSG (Pol 



lard and Sag, 1994) where indices are used as ar- 



1 Introduction 

Bag generation is a form of natural language gen- 
eration in which the input is a bag (also known as 
a multiset: a set in which repeated elements are 
significant) of lexical elements and the output is a 
grammatical sentence or a statistically most prob- 
able permutation with respect to some language 
model. 

Bag generation has been considered within the 
statistical and rule-based paradigms of computa- 
tional linguistics, and each has handled this prob- 



guments to relations; however these indices may 
also be equated with discourse referents in DRT 
( [Kamp and Reyle, 1993| ). As with most lexical- 
ist generators, semantic variables must be distin- 
guished in order to disallow translationally incor- 
rect permutations of the target bag. We distin- 
guish variables by uniquely typing them. 

Two assumptions are made regarding lexical- 
semantic indexing. 

Assumption 1 All lexical signs must be indexed, 
i ncluding functional and nonpredicative elements 
alder et al, 198^ ). 



lem differently (Chen and Lee, 1994; Whitclock 
1994|; [Popowich, 1995|; [Trujillo, 19951). This pa- 



per only considers rule based approaches to this 
problem. 

Bag generation has received particular atten- 
tion in lexicalist approaches to MT, as exempli- 



fied by Shake-and-Bake generation (Beaven, 199S 
Whitelock, 1994). One can also envisage applica- 



tions of bag generation to generation from mini- 
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Assumption 2 All lexical signs must be con- 
nected to each other. Two lexical signs are con- 
nected if they are directly connected; furthermore, 
the connectivity relation is transitive. 

Definition 1 Two signs. A, B, are directly con- 
nected if there exist at least two paths, PathA, 
PathB, such that A:PathA is token identical with 
BiPathB. 

The indices involved in determining connec- 
tivity are specified as parameters for a par- 
ticular formalism. For example, in HPSG, 



they would be indicated through paths such as 

SYNSEM:LOCAL:CONTENT:INDEX. 

To ensure that only connected lexical signs are 
generated and analysed, the following assumption 
must also be made: 

Assumption 3 A grammar will only generate or 
analyse connected lexical signs. 

2 Bag Generation Algorithms 

Two main types of rule-based bag generators have 
been proposed. The first type consists of a parser 
suitably relaxed to take into account the un- 



ordered charact e r of the input ( Whitclock, 1994 



Popowich, 199£; Trujillo, 1995). For example, in 



generators based on a chart parser, the fundamen- 
tal rule is applied only when the edges to be com- 
bined share no lexical leaves, in contrast to re- 
quiring that the two edges have source and target 
nodes in common. The other type of generator ap- 
plies a greedy algorithm to an initial solution in 



order to find a grammatical sentence (Poznahski 
et al., 1995| ). 



2.1 Redundancy in Bag Generation 

One disadvantage with the above generators is 
that they construct a number of structures which 
need not have been computed at all. In build- 
ing these structures, the generator is effectively 
searching branches of the search space which never 
lead to a complete sentence. Consider the the fol- 
lowing input bag: 

{ dog, barked, the, brown, big} 

Previous researchers ( Brew, 1992 ; Phillips, 1993| ) 
have noted that from such a bag, the following 
strings are generated but none can form part of 
a complete sentence (note that indices are omit- 
ted when there is no possibility of confusion; =f/= 
indicates that the substring will never be part of 
a complete sentence): 

Ex. 1 # the dog 

# the dog barked 

# the brown dog 

For simple cases in chart based generators such 
unnecessary strings do not create many problems, 
but for longer sentences, each additional substring 
implies a further branch in the search tree to be 
considered. 

Since the computa tional complexity of the 
greedy bag generator ( Poznahski et al., 199£ ) is 
polynomial (i.e. 0{n*)), the effect of redundant 
substructures is not as detrimental as for parser 
based generators. Nevertheless, a certain amount 
of unnecessary work is performed. To show this, 
consider the test-rewrite sequence for Example 0: 



Test: dog barked the brown big 
Rewrite: barked the dog brown big 
Test: barked (the dog) brown big 
Rewrite: (the dog) barked brown big 
Test: ((the dog) barked) brown big 
Rewrite: the brown dog barked __ big 
Test: ((the (brown dog)) barked) big 
Rewrite: the big (brown dog) barked __ 
Test: ((the (big (brown dog))) barked) (ter- 
minate) 

In this sequence double underscore (__) indi- 
cates the starting position of a moved constituent; 
the moved constituent itself is given in bold face; 
the bracketing indicates analysed constituents (for 
expository purposes the algorithm has been over- 
simplified, but the general idea remains the same) . 

Now consider the step where 'brown' is inserted 
between 'the' and 'dog'. This action causes the 
complete structure for 'the dog barked' to be dis- 
carded and replaced with that for 'the brown dog 
barked', which in turn is discarded and replaced 
by 'the big brown dog barked'. 

2.2 Previous Work 

A number of pruning techniques have been sug- 
gested to reduce the amount of redundancy in bag 
generators. Brew (1992) proposed a constraint 
propagation technique which eliminates branches 
during bag generation by considering the nec- 
essary functor-argument relationships that exist 
between the component basic signs of categorial 
signs. These relationships form a graph indicat- 
ing the necessary conditions for a lexical item to 
form part of a complete sentence. Such graphs can 
be used to eliminate the substrings in Example |l|. 
Unfortunately the technique exploits specific as- 
pects of categorial grammars and it is not clear 
how they may be used with other formalisms. 



Trujillo (1995) adapts some of Brew's ideas 



to phrase structure grammars by compiling Fol- 
low functions and constructing adjacency graphs. 
While this approach reduces the size of the search 
space, it does not prune it sufficiently for certain 
classes of modifiers. 



Phillips ( 1993 ) proposes handling inefficiency at 
the expense of completeness. His idea is to main- 
tain a queue of modifiable constituents (e.g. Nls) 
in order to delay their combination with other 
constituents until modifiers (e.g. PPs) have been 
analysed. While practical, this approach can lead 
to alternative valid sentences not being generated. 

3 Connectivity Restrictions 

In searching for a mechanism that eliminates un- 
necessary wfss, it will be possible to use indices in 
lexical signs. As mentioned earlier, these indices 



play a major role in preventing the generation of 
incorrect translations. 



3) 
4) 
5) 
6) 
7) 



CAT = S 
SEM =40\ 

CAT = NP 
SEM =fO] 

CAT = N1 
SEM =40\ 

CAT = N1 
SEM =40\ 

CAT = N1 
SEM^fo] 

CAT = PP 
SEM =[0] 

CAT = VP 
SEM =[0] 



CAT = NP 

SEM:ARGl 

CAT = Det 
SEM;ARGl =(T] 

CAT = A 
SEM;ARGl =|T| 

CAT = N1 
SEM;ARGl =|T| 

CAT = N 

SEM =roi 



CAT = VP 

sem=[o][arg2=|T|] 

"CAT = N1 

sem=[o][arg1=|T|] 

CAT = N1 

sem=[o][arg1=|T]] 

CAT = PP 

sem=[o][arg1=|T]] 



CAT = P 

SEM=[0][ ARG3=[2]] 

CAT = Vtra 
sem=[o][arg3=[2]] 



CAT = NP 

sem:Arg1 =4^] 

■ CAT = NP 
SEMIARGl =(2\ 



Figure 1: Simple unification grammar. 

It will be shown that it is possible to exploit 
the connectivity Assumption ^ above in order to 
achieve a reduction in the number of redundant 
wfss constructed by both types of generator de- 
scribed in section ^. 

3.1 Using Connectivity for Pruning 

Take the following bag: 

Ex. 2 {dogi,thei,browni,bigi} 

(corresponding to 'the big brown dog'). Assume 
that the next wfss to be constructed by the gen- 
erator is the NP 'the dog'. Given the grammar 
in Figure |l|, it is possible to deduce that 'brown' 
can never be part of a complete NP constructed 
from such a substring. This can be determined 
as follows. If this adjective were part of such a 
sentence, 'brown' would have to appear as a leaf 
in some constituent that combines with 'the dog' 
or with a constituent containing 'the dog'. From 
the grammar, the only constituents that can com- 
bine with 'dog' are VP, Vtra and P. However, 
none of these constituents can have 'browni' as 
a leaf: in the case of P and Vtra this is trivial, 
since they are both categories of a different lexi- 
cal type. In the case of the VP, 'browni' cannot 
appear as a leaf either because expansions of the 
VP are restricted to NP complements with 2 as 
their semantic index, which in turn would also re- 
quire adjectives within them to have this index. 
Furthermore, 'browni' cannot occur as a leaf in 
a deeper constituent in the VP because such an 
occurrence would be associated with a different 
index. In such cases 'brown' would modify a dif- 
ferent noun with a different index: 

Ex. 3 {thei,dogi,withi^2,the2,brown2,collar2} 



A naive implementation of this deduction would 
attempt to expand the VP depth-first, left to 
right, in order to accommodate 'brown' in a com- 
plete derivation. Since this would not be possible, 
the NP 'the dog' would be discarded. This ap- 
proach is grossly inefficient however. What is re- 
quired is a more tractable algorithm which, given 
a wfss and its associated sign, will be able to deter- 
mine whether all remaining lexical elements can 
ever form part of a complete sentence which in- 
cludes that wfss. 

Note that deciding whether a lexical sign can 
appear outside a phrase is determined purely by 
the grammar, and not by whether the lexical ele- 
ments share the same index or not. Thus, a more 
complex grammar would allow 'the man' from the 
bag 

Ex. 4 {thei ,mani,shaveSe,i,i .himselfi} 

even though 'himself has the same index as 'the 
man'. 

3.2 Outer Domains 

The approach introduced here compiles the rel- 
evant information offline from the grammar and 
uses it to check for connectivity during bag gener- 
ation. The compilation process results in a set of 
(Sign, Lex, Bindings) triples called outer domains. 
This set is based on a unification-based phrase 
structure grammar defined as follows: 

Definition 2 A grammar is a tuple (N,T,P,S), 
where P is a set of productions a ^ (3, a is a 
sign, P is a list of signs, N is the set of all a, T 
is the set of all signs appearing as elements of (3 
which unify with lexical entries, and S is the start 
sign. 

Outer domains are defined as follow: 

Definition 3 { (Sign, Lex, Binds) \ Sign G N \J 
T , Lex G T and there exists a derivation 
a ^ Pi Sign' Lex/ Pi or a ^ PiLex! P2Sign' Pz, 
and Sign' a unifier for Sign, LeJ a unifier 
for Lex, and Binds the set of all path pairs 
<SignPath,LexPath> such that Sign' :SignPath is 
token identical with Lexf :LexPath} 

Intuitively, the outer domains indicate that 
preterminal category Lex can appear in a com- 
plete sentence with subconstituent Sign, such that 
Lex is not a l eaf of Sign. Usi ng ideas from data 
flow analysis ( Kennedy, 1981 ), predictive parser 
constructions ( Aho et al., 1986 ) and feature gram- 
mar compilation ( Trujillo, 1994 ) it is possible to 
construct such a set of triples. Outer domains 
thus represent elements which may lie outside a 
subtree of category Sign in a complete sentential 



derivation. The following definition specifies how 
outer domains are used: 

Definition 4 A lexical sign Lejf is in the 
outer domain of Sign' iff there is a triple 
(Sign, Lex, Binds) in outer domains such that Sign 
and Lex unify with Sign' and Leaf respectively, and 
there is at least one pair <PathS,PathL> £ Binds 
such that Sign' : Paths unifies with Leaf :PathL. 

In compiling outer domains, inner domains are 
used to facilitate computation. Inner domains are 
defined as follows: 

Definition 5 { (Sign, Lex, Binds) \ Sign £ NUT, 
Lex <E T and there exists a derivation a =4» 
PiLexf [32, with Sign' a unifier for Sign, Lexf a uni- 
fier for Lex, and Binds the set of all path pairs 
<SignPath,LexPath> such that Sign' :SignPath is 
token identical with Leaf :LexPath} 

The inner domains thus express all the possible 
terminal categories which may be derived from 
each nonterminal in the grammar. 

To be able to exploit connectivity during gen- 
eration, inner and outer domains contain only 
triples in which Binds has at least one element. 
In this way, only those lexical categories which are 
directly connected to the sign are taken into ac- 
count; the implication of this will become clearer 
later. 

As an example, the outer domain of NP as de- 
rived from the above grammar is: 

(NP[sem;argl:X],Vtra[sem:arg2:Y], 

{<sem:argl,sem:arg2>}) 
(NP[sem:argl:X],Vtra[sem:arg3:Y], 

{<sem:argl,sem:arg3>}) 
(NP[sem:argl:X] ,P[sem:arg3;Y] , 

{<sem:argl,sem:arg3>}) 

This set indicates that for any NP, the only ter- 
minal categories not contained in the subtree with 
root NP, and with which the NP shares a seman- 
tic index, are Vtra and P. For instance, the first 
triple arises from the following tree: 



S 




NP[sem:argl:X] VP[sem:arg2:X] 




Vtra[sem:arg2:X] NP 

3.3 Pruning through Outer Domains and 
Connectivity 

The pruning technique developed here operates 
on grammars whose analyses result in connected 
leaves. 

Consider some wfss W constructed from a bag B 
and with category C; this category, in the form of 
a sign, will include syntactic and lexical-semantic 



information. Such a wfss will have been con- 
structed during the bag generation process. Now, 
either W includes all the input elements as leaves, 
in which case W constitutes a complete sentence, 
or there are elements in the input bag which are 
not part of W. In the latter case, for bags obeying 
Assumption |^, the following condition holds for 
any W that can form part of a complete sentence: 

Condition 1 Let L he the set of leaves appearing 
in W, let G be the graph (V,E), where V = {€} 
U B ~ L, and E = { {x,y} \ x,y £ V and y is in 
the outer domain ofx}. Then G is connected. 

To show that this condition indeed holds, con- 
sider a grammatical ordering of some input bag 
B, represented as the string W: 

a..'y5..Lj 

By Assumption ^, the lexical elements in the bag, 
and therefore in any grammatical ordering of it, 
are connected. Now consider reducing this string 
using the production rule: 

D ^ 7(5 

to give the string W': 

a..D..Lu 

In this case, the signs in W' will also be connected. 
This can be shown by contradiction: 

Proof 1 Assume that there is some sign C, in W 
to which D is not connected. Then grammar G 
would allow disconnected strings to be generated, 
contrary to Assumption |^. This is because D 
would not be able to rewrite 71 (5i in such a way 
that both daughters were connected to C,, leading 
to a disconnected string. 

The situation in string W is analogous to that 
in Condition |l|. By identifying signs which are 
directly connected in E, it is possible to determine 
whether E is connected and consequently whether 
C can form part of a complete derivation. Instead 
of simply comparing the value of index paths, it is 
more restrictive to use outer domains since they 
give us precisely those elements which are directly 
connected to a sign and are in its outer domain. 

3.4 Example 

Consider Example |[ To eliminate the wfss 
'the dog' from further consideration, a connected 
graph of lexical signs is constructed before gen- 
eration is started (Figure |[). This graph is built 
by using the outer domain of each lexical element 
to decide which of the remaining elements could 
possibly share an index with it in a complete sen- 
tence. 




thei 



^^^^ browni 
Figure 2: Initial connected graph. 

When a new wfss is constructed during genera- 
tion, say by application of the modified fundamen- 
tal rule or during the rewrite phase in a greedy al- 
gorithm, this initial graph is updated and tested 
for connectivity. If the updated graph is not con- 
nected then the proposed wfss cannot form part of 
a complete sentence. Updating the graph involves 
three steps. Firstly every node in the graph which 
is a leaf of the new wfss is deleted, together with 
its associated arcs. Secondly, a new node corre- 
sponding to the new wfss is added to the graph. 
Finally, a new arc is added to the graph between 
the new node and every other node lying in its 
outer domain. The updated (disconnected) graph 
that ensues after constructing 'the dog' is shown 
in Figure ^; this NP is therefore rejected. 

'the dog'i 



bigi 



browni 



Figure 3: Updated disconnected graph after the 
wfss 'the dog' is constructed. 



4 Compiling Connectivity 
Domains 

For reasons of space, the computation of outer do- 
mains cannot be described fully here. The broad 
outline, however, is as follows. First, the inner 
domains of the grammar are calculated. This in- 
volves the calculation of the fixed point of set 
equations, analogous to those used in the con- 
struction of First sets for p redictive parsers (Aho 
et al., 1986| ; iTrujillo, 199^ ). Given the inner do- 



mains of each category in the grammar, the con- 
struction of the outer domains involves the com- 
putation of the fixed point of set equations relat- 
ing the outer domain of a category to the inner 
domain of its sisters and to the outer domain of 
its mother, in a manner analogous to the compu- 
tation of Follow sets. 



During computation, the set of Binds is mono- 
tonically increased as different ways of directly 
connecting sign and lexeme are found. 

5 Results 

The above pruning technique has been tested on 
bags of different sizes including different combina- 
tions of modifiers. Sentences were generated using 
two versions of a modified chart parser. In one, 
every inactive edge constructed was added to the 
chart. In the other, every inactive edge was tested 
to see if it led to a disconnected graph; if it did, 
then the edge was discarded. The results of the 
experiment are shown in Table |l|. The implemen- 
tation was in Prolog on a Sun SparcStation 10; the 
generation timings do not include garbage collec- 
tion time. The grammar used for the experiment 
consisted of simplified, feature-based versions of 
the ID rules in GPSG; there were 18 rules and 
50 lexical entries. Compilation of the outer do- 
mains for these rules took approximately 37 min- 
utes, and the resulting set occupies 40K of mem- 
ory. In the general case, however, the size of the 
outer domains is 0{'n'^), where n is the number 
of distinct signs; this number can be controlled 
by employing equivalence classes of different lev- 
els of specificity for pre-tcrminal and non-terminal 
signs. 
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0.1 


15 
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0.4 


36 
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103 


2.0 


99 
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0.9 


72 
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67 
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213 


3.9 


138 


12 
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133 


3.4 


123 


15 


9.0 


294 


7.2 


186 


15 


17.6 


448 


11.1 


253 


17 


2.3 


126 


2.6 


105 



Table 1: Effect of pruning (times in sees). 

Only one reading was generated for each bag, 
corresponding to one attachment site for PPs. 
The table shows that the technique can yield re- 
ductions in the number of edges (both active and 
inactive) and time taken, especially for longer sen- 
tences, while retaining the overheads at an accept- 
able level. 

6 Conclusion 

A technique for pruning the search space of a bag 
generator has been implemented and its usefulness 
shown in the generation of different types of con- 
structions. The technique relies on a connectivity 
constraint imposed on the semantic relationships 



expressed in the input bag. In order to apply the 
algorithm, outer domains needed to be compiled 
from the grammar; these are used to discard wfss 
by ensuring lexical signs outside a wfss can indeed 
appear outside that string. 

Exploratory work employing adjacency con- 
straints during generation has yielded further im- 
provements in execution time when applied in con- 
junction with the pruner. If extended appropri- 
ately, these constraints could prune the search 
space even further. This work will be reported 
at a later date. 
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